Indexing

Types of Indexes
Types of Merges
Indexing-Related Performance Counters
Failure Handling
Property Cache

Storing the words and properties extracted by the CiDaemon process in indexes is referred to as indexing. An index is a special data structure that is used to satisfy queries efficiently.

As documents in the corpus (Web site) are modified, the indexing program is notified of the updates and those documents enter a change queue. The CiDaemon process retrieves documents from the change queue in “first-in-first-out” order and filters them. The resulting words and properties are then added to the index. If there are many documents waiting in the change queue, there will be a delay before the index has up-to-date information about those documents.

Types of Indexes

There are three types of indexes: word lists, shadow indexes, and a master index. Words and properties extracted from a document first appear in a word list, then move to a shadow index, and finally move to the master index. This organization is optimized for query responsiveness and performance. It also ensures optimal resource usage. Even though there are multiple indexes internally, these details are completely hidden from the user. The user sees only a list of documents that satisfy the query that was posted.

Word Lists

Word lists are small, in-memory indexes. Each word list contains data for a small number of documents. As soon as a document is filtered, its data is stored in a word list. Creation of a word list is very quick and does not require updating any on-disk data. It is used as a temporary staging area during indexing.

There are several registry parameters that control word list behavior. All the keys are under the registry path \HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\Content Index. The following table shows the registry parameters and explanations.

Parameter	Explanation
MaxWordLists	Maximum number of word lists. If the number exceeds this, a shadow merge will be performed.
MaxWordlistSize	Maximum recommended size of a single word list. If the size of a word list exceeds this value, a new one will be created. This is an internal value and must not be changed.
MinSizeMergeWordlists	If the combined size of all word lists exceeds this number, a shadow merge will be performed.

Once the number of word lists exceeds the MaxWordLists parameter, the word lists are merged into a shadow index. This merge process is called the shadow merge. Although the data in word lists is compressed to some extent, the compression is not very high because word lists are temporary structures. Because word lists are in-memory strucures, documents in a word list must be refiltered whenever IIS is restarted. The refiltering is automatically detected and performed by the Microsoft Index Server engine.

Persistent Index

When data for an index is stored on disk, it is called a persistent index. Unlike word lists, which are in-memory indexes, a persistent index survives shutdowns and restarts. Persistent-index data is stored in a highly compressed format. There are two types of persistent indexes:

Shadow Index

A shadow index is a persistent index created by merging word lists and sometimes other shadow indexes into a single index. There can be multiple shadow indexes in the catalog.

Master Index

A master index is a persistent index that contains the indexed data for a large number of documents. This is usually the largest persistent data structure. In an ideal state, this is the only index present, because all the indexed data is stored in the master index and there are no shadow indexes or word lists. The data is highly compressed.

A master index is created by master merge, which merges all the shadow indexes and the current master index (if any) into a new master index. After the master merge, all the source indexes are deleted and only the new master index will be left. In this state, queries are resolved most efficiently.

The total number of persistent indexes (shadow indexes and master index) in a catalog cannot exceed 255.

Types of Merges

The process of combining data from multiple indexes into a single index is called merging. Merging results in getting rid of some redundant data and also freeing up resources. Queries are also resolved faster with fewer indexes. There are three types of merges:

After a merge completes, the multiple source indexes are replaced by a single target index.

Shadow Merge

Combining multiple word lists and shadow indexes into a single shadow index is called a shadow merge. A shadow merge is performed to free up memory used by word lists and also to make the filtered data persistent; it is usually a quick operation.

In the most common case, the source indexes for a shadow merge are word lists. However, if the total number of shadow indexes exceeds MaxIndexes, some of the shadow indexes are also used as source indexes. Shadow indexes are also used as source indexes during an annealing merge.

A shadow merge is triggered by one of the following conditions:

The number of word lists exceeds MaxWordLists.
The combined size of WordLists exceeds MinSizeMergeWordLists.
As a precursor to a master merge. Before starting a master merge, a shadow merge is performed to merge all existing word lists into a shadow index.
For an annealing merge.

Master Merge

For a master merge, the source indexes are all of the existing shadow indexes and the current master index (if any). At the end of a master merge, all the source indexes are replaced by a single target master index. Although the master merge itself is a very resource-intensive (both for CPU and disk space) operation, after the completion of a master merge, resources are freed up. A lot of the redundant data is deleted and queries run faster.

Depending upon the size of the source indexes, a master merge can be a very long-running operation. However, it is fully restartable after failures and shutdowns. A master merge will continue from where it left off.

Whenever a master merge is started, restarted, or paused, an event is written to the event log. There are several reasons for starting a master merge. Some reasons for starting a master merge follow.

Nightly maintenance master merge. This can be done at a specified time every day. The registry value MasterMergeTime is the number of minutes after midnight when the merge should happen. By default, the nightly master merge happens at midnight. This value should be adjusted to reflect the time when the load on the server is lowest.
When the number of changed documents since the last master merge exceed MaxFreshCount, a master merge is performed to reduce the number of changed documents. If the number of changed documents is too high, it puts an extra load on memory usage. A master merge reduces the FreshCount to zero.
When the disk space remaining on the catalog drive is less than MinDiskFreeForceMerge and the cumulative space occupied by shadow indexes exceeds MaxShadowFreeForceMerge, a master merge is started to combine the shadow indexes and free up disk space.
When the total disk space occupied by shadow indexes exceeds MaxShadowIndexSize, a master merge is started to combine the shadow indexes. This condition has higher precedence than the previous condition.
Finally, a master merge can be forced by an administrator by using the adminstrative Web pages. Because of the fact that a master merge will make queries run faster after it completes, the administrator may want to force a merge even before one of the preceding conditions triggers it.

Annealing Merge

An annealing merge is a special kind of shadow merge performed when the system is idle for a cerain length of time and the total number of persistent indexes exceed MaxIdealIndexes. The registry parameter MinMergeIdleTime specifies the percentage of CPU time that must be idle during a time period to trigger an annealing merge. An annealing merge improves query performance and disk space usage by reducing the number of shadow indexes.

Indexing-Related Performance Counters

The following performance counters are related to indexing and merging. They are all present under the Content Index object.

Counter Name	Explanation
Index Size	Total size of all the persistent indexes in megabytes.
Persistent Indexes	Total number of persistent indexes.
Merge Progress	Percentage of merge completed.
Word lists	Total number of word lists.

Failure Handling

There can be several kinds of failures during filtering, indexing, and merging. Except after a hardware failure, indexes are fully recoverable. Indexing and merging are temporarily paused if memory load is very high. Failed operations are retried later.

Disk Full Condition

A merge is not started if the free disk space is very low on the catalog drive. However, it is possible that the drive may run out of free space while the merge is happening. A shadow merge will be aborted and retried after disk space is freed up. A master merge is not ended but temporarily paused. An event is written to the Windows NT event log that the Master Merge is paused. A disk full event will also be written. The administrator should free up disk space by moving or deleting data files from the corpus. Do not delete any files under the Index Server Catalog Directory. The system will automatically detect when enough disk space has been freed and restart the operations.

Property Cache

The property cache is an on-disk store optimized to speed up the retrieval of frequently retrieved values such as Path, Abstract, Title, Attributes, Last Write time stamp, File Size, and some values for internal use only. In a future release, administrators will be able to configure the property cache for storing custom properties.

The property cache is also a large data structure, its size being comparable to that of the master index. The registry parameter PropertyStoreMappedCache controls how much of the property cache is always kept in memory. On large index servers, setting this value higher will yield better performance. If the physical memory is not adequate, the performance might suffer.

Failure Recovery

During a dirty shutdown, the property cache may become corrupted. During startup after a dirty shutdown, a consistency check is performed on the property cache and if any problems are detected, they are fixed. However, some irrepairable inconsistencies may occur. In that case, all existing index data is thrown out and documents automatically re-index. Please see the Error Detection and Recovery page for more information.

Events are written to the event log when a recovery operation is performed on the property cache.